PostgreSQL Page页结构解析 B-Tree索引Special Space 存储结构
本文简单介绍了在PG数据库B-Tree索引的物理存储结构中Special space部分,包括根节点、左右兄弟节点等相关索引结构信息,以及初步探讨了PG在物理存储上如何保证索引的有序性。
环境准备
测试数据同上一节,索引文件raw data:
hexdump -C $PGDATA/base/13758/16441
00000000 00 00 00 00 08 9f 79 01 00 00 00 00 48 00 f0 1f |......y.....H...|
00000010 f0 1f 04 20 00 00 00 00 62 31 05 00 04 00 00 00 |... ....b1......|
00000020 01 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0 bf |................|
00000040 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00001ff0 00 00 00 00 00 00 00 00 00 00 00 00 08 00 00 00 |................|
00002000 00 00 00 00 78 9e 79 01 00 00 00 00 28 00 b0 1f |....x.y.....(...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 b0 9f 20 00 00 00 00 00 |.. ... ... .....|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
索引结构信息
索引物理存储结构在上一节已大体介绍,这里主要介绍索引的结构信息。通过pageinspect插件的bt_page_stats函数可以获得索引结构信息,包括root/leaf page,next & previous page:
1 定位root_page
SELECT * FROM bt_metap('pk_t_index');
magic | version | root | level | fastroot | fastlevel | last_cleanup_num_delpages | last_cleanup_num_tuples | allequalimage
--------+---------+------+-------+----------+-----------+---------------------------+-------------------------+---------------
340322 | 4 | 1 | 0 | 1 | 0 | 0 | -1 | t
(1 row)
2 查看SpecialSpace.stats
postgres=# select * from bt_page_stats('pk_t_index',1);
blkno | type | live_items | dead_items | avg_item_size | page_size | free_size | btpo_prev | btpo_next | btpo_level | btpo_flags
-------+------+------------+------------+---------------+-----------+-----------+-----------+-----------+------------+------------
1 | l | 4 | 0 | 16 | 8192 | 8068 | 0 | 0 | 0 | 3
(1 row)
相关的数据结构如下:
//---------------------- src/include/access/nbtree.h
typedef struct BTPageOpaqueData
{
BlockNumber btpo_prev; /* left sibling, or P_NONE if leftmost */
BlockNumber btpo_next; /* right sibling, or P_NONE if rightmost */
uint32 btpo_level; /* tree level --- zero for leaf pages */
uint16 btpo_flags; /* flag bits, see below */
BTCycleId btpo_cycleid; /* vacuum cycle ID of latest split */
} BTPageOpaqueData;
typedef BTPageOpaqueData *BTPageOpaque;
/* Bits defined in btpo_flags */
#define BTP_LEAF (1 << 0) /* leaf page, i.e. not internal page */
#define BTP_ROOT (1 << 1) /* root page (has no parent) */
#define BTP_DELETED (1 << 2) /* page has been deleted from tree */
#define BTP_META (1 << 3) /* meta-page */
#define BTP_HALF_DEAD (1 << 4) /* empty, but still in tree */
#define BTP_SPLIT_END (1 << 5) /* rightmost page of split group */
#define BTP_HAS_GARBAGE (1 << 6) /* page has LP_DEAD tuples (deprecated) */
#define BTP_INCOMPLETE_SPLIT (1 << 7) /* right sibling's downlink is missing */
#define BTP_HAS_FULLXID (1 << 8) /* contains BTDeletedPageData */
查询结果说明
输出 | 说明 |
---|---|
type | l(字母l),表示这个block(page)是leaf block |
live_items | 在这个block中有4个item(live_items=4), |
dead_items | 没有废弃的items(dead_items=0), |
left sibling | 没有left sibling(btpo_prev =0 |
right sibling | 也没有right sibling(btpo_next =0) |
btpo | btpo是一个union,值为0,表示该page为叶子page, |
btpo_flags | 3即BTP_LEAF 或 BTP_ROOT,既是叶子page也是根page。 |
Warning
1.这些信息物理存储在先前介绍过的PageHeader中的pd_special指向的物理位置中,pd_special共占用2个字节:
2. Special Space 占用16 个字节。
物理位置分析
1 special space 计算位置
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E5C98 | 0 | 0 | 40 | 8112 | 8176 | 8192 | 4 | 0
(1 row)
testdb=# select 8192+8176;
?column?
----------
16368
(1 row)
[xdb@localhost utf8db]$ hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 16
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
2 btpo_prev
hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16368 -n 4
00003ff0 00 00 00 00 |....|
00003ff4
3 btpo_next
hexdump -C $PGDATA/base/13758/16441 -s 16372 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16372 -n 4
00003ff4 00 00 00 00 |....|
00003ff8
4 btpo_level
hexdump -C $PGDATA/base/13758/16441 -s 16376 -n 4
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16376 -n 4
00003ff8 00 00 00 00 |....|
00003ffc
5 btpo_flags
hexdump -C $PGDATA/base/13758/16441 -s 16380 -n 2
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16380 -n 2
00003ffc 03 00 |..|
00003ffe
6 btpo_cycleid
hexdump -C $PGDATA/base/13758/16441 -s 16382 -n 2
[postgres@node1 ~]$ hexdump -C $PGDATA/base/13758/16441 -s 16382 -n 2
00003ffa 00 00 |..|
00003ffc
索引有序性
我们都知道,B-Tree索引是有序的,下面我们看看在物理存储结构上如何保证有序性。
插入数据,id=18
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E5C98 | 0 | 0 | 40 | 8112 | 8176 | 8192 | 4 | 0
(1 row)
testdb=# -- 插入数据,id=18
testdb=# insert into t_index values(18,'4','d');
INSERT 0 1
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E6498 | 0 | 0 | 44 | 8096 | 8176 | 8192 | 4 | 0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump -C $PGDATA/base/13758/16441 -s 8192
00002000 01 00 00 00 98 64 0e db 00 00 00 00 2c 00 a0 1f |.....d......,...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 a0 9f 20 00 00 00 00 00 |.. ... ... .....|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003fa0 00 00 00 00 05 00 10 00 12 00 00 00 00 00 00 00 |................|
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
插入数据,id=17
testdb=# -- 插入数据,id=17
testdb=# insert into t_index values(17,'4','d');
INSERT 0 1
testdb=# checkpoint;
CHECKPOINT
testdb=# -- 查看索引数据页头数据
testdb=# select * from page_header(get_raw_page('pk_t_index',1));
lsn | checksum | flags | lower | upper | special | pagesize | version | prune_xid
------------+----------+-------+-------+-------+---------+----------+---------+-----------
1/DB0E6808 | 0 | 0 | 48 | 8080 | 8176 | 8192 | 4 | 0
(1 row)
-- dump索引页
[xdb@localhost utf8db]$ hexdump -C base/16477/26637 -s 8192
00002000 01 00 00 00 08 68 0e db 00 00 00 00 30 00 90 1f |.....h......0...|
00002010 f0 1f 04 20 00 00 00 00 e0 9f 20 00 d0 9f 20 00 |... ...... ... .|
00002020 c0 9f 20 00 b0 9f 20 00 90 9f 20 00 a0 9f 20 00 |.. ... ... ... .|
00002030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00003f90 00 00 00 00 06 00 10 00 11 00 00 00 00 00 00 00 |................|
00003fa0 00 00 00 00 05 00 10 00 12 00 00 00 00 00 00 00 |................|
00003fb0 00 00 00 00 04 00 10 00 10 00 00 00 00 00 00 00 |................|
00003fc0 00 00 00 00 03 00 10 00 08 00 00 00 00 00 00 00 |................|
00003fd0 00 00 00 00 02 00 10 00 04 00 00 00 00 00 00 00 |................|
00003fe0 00 00 00 00 01 00 10 00 02 00 00 00 00 00 00 00 |................|
00003ff0 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 |................|
00004000
索引的数据区,并没有按照大小顺序排序,\x11(17)在\x12(18)的后面(从尾部开始往前),但在索引页的头部ItemId区域是有序的,第5个ItemId(\x00209f90)指向的是17,而第6个ItemId(\x00209fa0)指向的是18,通过索引数据指针的有序保证索引有序性。
0.1 四、小结
小结一下,知识要点:
1、Special Space:介绍了索引PageHeaderData的Special Space的存储结构,通过Special Space可以获得B-Tree的root、左右sibling等信息;
2、有序性:索引Page通过索引项指针保证有序性。